3D Reconstruction

# 3D Reconstruction

LHM

LHM (Large-scale Animatable Human Reconstruction Model) utilizes a multimodal transformer architecture for high-fidelity 3D head reconstruction, supporting the generation of animatable 3D human characters from a single image. The model can accurately preserve clothing geometry and texture, and is particularly excellent at restoring facial identity and details, making it suitable for application scenarios with high requirements for 3D reconstruction accuracy.

Prompt Depth Anything

Prompt Depth Anything

Prompt Depth Anything is a method for high-resolution and high-precision depth estimation. This method unlocks the potential of depth foundational models through prompting techniques, using iPhone LiDAR as a cue to guide the model in generating precise depth measurements of up to 4K resolution. Additionally, it introduces a scalable data pipeline for training and has released a more detailed ScanNet++ dataset with depth annotations. The main advantages of this technology include high-resolution and high-precision depth estimation, along with benefits for downstream applications such as 3D reconstruction and generalized robotic grasping.

PSHuman

PSHuman is an innovative framework that utilizes multi-view diffusion models and explicit reconstruction techniques to reconstruct realistic 3D human models from a single image. Its significance lies in its ability to handle complex self-occlusion issues and avoid geometric distortions in the generated facial details. PSHuman achieves richly detailed new perspective generation while maintaining identity features by jointly modeling global body shapes and local facial characteristics with cross-scale diffusion models. Additionally, PSHuman enhances cross-view body shape consistency under different human postures using physical priors provided by parameterized models like SMPL-X. Key advantages of PSHuman include rich geometric details, high texture fidelity, and strong generalization capability.

CAT4D

CAT4D is a cutting-edge technology that generates 4D scenes from monocular videos using multi-view video diffusion models. It transforms input monocular videos into multi-perspective video and reconstructs dynamic 3D scenes. The significance of this technology lies in its ability to extract and reconstruct complete spatial and temporal information from single-view video footage, providing robust technical support for virtual reality, augmented reality, and 3D modeling. Background information indicates that CAT4D is a collaborative project developed by researchers from Google DeepMind, Columbia University, and UC San Diego, representing a successful case of turning advanced research outcomes into practical applications.

LucidFusion

LucidFusion is a flexible end-to-end feedforward framework designed for generating high-resolution 3D Gaussians from unposed, sparse, and any number of multi-view images. This technology uses Relative Coordinate Maps (RCM) to align geometric features between different views, providing a high degree of adaptability for 3D generation. LucidFusion integrates seamlessly with traditional single-image-to-3D processes, producing detailed 3D Gaussians at 512x512 resolution suitable for a wide range of applications.

Long-LRM

Long-LRM is a model designed for 3D Gaussian reconstruction, capable of recreating large scenes from a series of input images. The model can process 32 images at a resolution of 960x540 in just 1.3 seconds, operating on a single A100 80G GPU. It integrates the latest Mamba2 modules with traditional transformer modules to enhance efficiency without compromising quality through effective token merging and Gaussian trimming. Unlike traditional feedforward models that can only reconstruct small portions of a scene, Long-LRM can regenerate the entire scene in one go. On large-scale scene datasets like DL3DV-140 and Tanks and Temples, Long-LRM's performance is comparable to optimization-based methods while achieving two orders of magnitude greater efficiency.

Flex3D

Flex3D features a two-stage process that generates high-quality 3D assets from a single image or text prompt. This technology represents the latest advancements in the field of 3D reconstruction, significantly improving the efficiency and quality of 3D content generation. Flex3D is developed with support from Meta, and the team has a strong background in 3D reconstruction and computer vision.

ViewCrafter

ViewCrafter is an innovative approach that leverages the generative capabilities of video diffusion models and the coarse 3D cues provided by point-based representations to synthesize high-fidelity new viewpoints of general scenes from single or sparse images. The method progressively expands the area covered by 3D cues and new viewpoints through iterative view synthesis strategies and camera trajectory planning algorithms, thereby increasing the generation range of new viewpoints. ViewCrafter can facilitate various applications, such as creating immersive experiences and real-time rendering by optimizing 3D-GS representations, as well as promoting imaginative content creation through scene-level text-to-3D generation.

AI image generation

OmniRe

OmniRe is a comprehensive method for efficiently reconstructing high-fidelity dynamic urban scenes from device logs. This technology achieves a complete reconstruction of different objects in the scene by constructing a dynamic neural scene graph based on Gaussian representations and building multiple local canonical spaces to simulate various dynamic actors, including vehicles, pedestrians, and cyclists. OmniRe enables comprehensive reconstruction of different objects present in a scene, allowing for real-time simulation of reconstructed scenes involving all participants. Extensive evaluations on the Waymo dataset show that OmniRe significantly outperforms previous state-of-the-art methods both quantitatively and qualitatively.

AI image generation

GenWarp

GenWarp is a model designed for generating new viewpoint images from a single image. It employs a semantically-preserving generative deformation framework that allows text-to-image generation models to learn where deformation and generation should occur. This model addresses the limitations of existing methods by enhancing cross-view attention and self-attention, leveraging conditional generative models on the source view image, and incorporating geometric deformation signals to improve performance across different field scenarios.

AI Image Generation

Ouroboros3D

Ouroboros3D is a unified 3D generation framework that integrates multi-view image generation and 3D reconstruction into a single recursive diffusion process. The framework jointly trains the two modules via a self-supervised mechanism, enabling them to adapt to each other and achieve robust inference. During multi-view denoising, the multi-view diffusion model utilizes 3D-aware rendered images from the reconstruction module at the previous timestep as additional conditioning. The combination of the recursive diffusion framework with 3D-aware feedback improves the overall geometric consistency of the process. Experiments demonstrate that the Ouroboros3D framework outperforms both separate training of the two stages and existing methods that combine them at inference time.

AI image generation

ViewDiff

ViewDiff is a method for generating multi-view consistent images from real-world data by leveraging pre-trained text-to-image models as prior knowledge. It incorporates 3D volume rendering and cross-frame attention layers into the U-Net network, enabling the generation of 3D-consistent images in a single denoising process. Compared to existing methods, ViewDiff generates results with better visual quality and 3D consistency.

AI image generation

DiffHuman

DiffHuman is a probabilistic, photorealistic 3D human reconstruction method. It can predict a probability distribution of 3D human reconstructions from a single RGB image and generate multiple detailed and colorful 3D human models through iterative denoising sampling. Compared to existing deterministic methods, DiffHuman can generate more detailed reconstructions in unknown or uncertain areas. We also introduce an accelerated rendering generative network, significantly improving inference speed.

AI image generation

GRM

GRM is a large-scale reconstruction model that can recover 3D assets from sparse view images in 0.1 seconds and achieve generation in 8 seconds. It is a feed-forward Transformer-based model that can efficiently fuse multi-view information to convert input pixels into pixel-aligned Gaussian distributions. These Gaussian distributions can be back-projected into a dense 3D Gaussian distribution collection representing the scene. Our Transformer architecture and the use of 3D Gaussian distributions unlock a scalable and efficient reconstruction framework. Extensive experimental results demonstrate that our method surpasses other alternatives in terms of reconstruction quality and efficiency. We also showcase GRM's potential in generation tasks (such as text-to-3D and image-to-3D) by combining it with existing multi-view diffusion models.

AI image generation

TripoSR

TripoSR, a 3D object reconstruction model developed by Stability AI and Tripo AI, can generate high-quality 3D models from a single image in less than a second. The model runs with low inference budget and does not require a GPU, making it suitable for a wide range of users and application scenarios. The model weights and source code are released under the MIT license, allowing for commercial, personal, and research use.

DUSt3R

DUSt3R is a novel dense and unconstrained stereo 3D reconstruction method applicable to any image set. It does not require prior knowledge of camera calibration or viewpoint pose information. By treating the pairwise reconstruction problem as a point cloud regression, DUSt3R relaxes the strict constraints of traditional projective camera models. DUSt3R provides a unified approach for both monocular and binocular reconstruction and proposes a simple and effective global alignment strategy for multi-image cases. The network architecture is built based on standard Transformer encoder and decoder, leveraging the power of pre-trained models. DUSt3R directly provides the 3D model and depth information of the scene and can recover pixel-wise matches, relative and absolute camera information from it.

VisFusion

VisFusion is a technology that utilizes video data for online 3D scene reconstruction. It can extract and reconstruct a three-dimensional environment in real-time from videos. This technology combines computer vision and deep learning to provide users with a powerful tool for creating precise 3D models.

PRISMA

PRISMA is a computational photography pipeline that can perform a variety of inferences from any image or video. Similar to how light is refracted into different wavelengths through a prism, this pipeline expands images into data usable for 3D reconstruction or real-time post-processing operations. It integrates various algorithms and open-source pretrained models, such as monocular depth (MiDAS v3.1, ZoeDepth, Marigold, PatchFusion), optical flow (RAFT), segmentation masks (mmdet), and camera pose estimation (colmap), among others. The results are stored in a folder with the same name as the input file, with each band saved as a separate .png or .mp4 file. For videos, in the final step, it attempts to perform sparse reconstruction, which can be used for NeRFs (such as NVidia's Instant-ngp) or Gaussian diffusion training. The inferred depth information is exported by default as heatmap GLSL/HLSL samples that can be decoded in real-time using LYGIA, and the optical flow is encoded as HUE (angle) and saturation, which can also be decoded in real-time using LYGIA's optical flow GLSL/HLSL sampler.

AI Image Generation

SIFU

SIFU is a method for reconstructing high-quality 3D virtual human models from lateral images. Its core innovation lies in proposing a new implicit function based on lateral images, which enhances feature extraction and improves geometric accuracy. Additionally, SIFU introduces a 3D consistent texture optimization process that significantly enhances texture quality and enables texture editing through a text-to-image diffusion model. SIFU excels in handling complex poses and loose clothing, making it an ideal solution for practical applications.

AI image generation

Repaint123

Repaint123 can generate high-quality, multi-view consistent 3D content in just 2 minutes from a single image. It combines the robust image generation capability of the 2D scattering model with the texture alignment ability of the progressive redraw strategy, creating high-quality, visually consistent multi-view images. It enhances the image quality during the redraw process by adjusting the redraw intensity based on visibility perception, making the generation of high-quality, multi-view consistent images achievable with a simple mean squared error loss function. The resulting high-quality, multi-view consistent images enable the rapid generation of 3D content.

AI image generation

Human101

Human101 is a framework for quickly reconstructing human figures from a single view. It can train a 3D Gaussian model within 100 seconds and render images at a resolution of 1024 and above 60FPS without pre-saving the Gaussian properties of each frame. The Human101 pipeline is as follows: first, extract 2D human pose from the single-view video. Then, use the pose to drive a 3D simulator to generate corresponding 3D skeletal animation. Finally, construct a time-varying 3D Gaussian model based on the animation for real-time rendering.

AI image generation

Gaussian SLAM

Gaussian SLAM is capable of reconstructing renderable 3D scenes from RGBD data streams. It is the first neural RGBD SLAM method capable of reconstructing real-world scenes with photorealistic fidelity. By leveraging 3D Gaussian as the primary unit for scene representation, we overcome the limitations of previous methods. We observe that traditional 3D Gaussians are difficult to utilize in monocular settings: they fail to encode accurate geometric information and are challenging to optimize sequentially with single-view supervision. By extending traditional 3D Gaussians to encode geometric information and designing a novel scene representation as well as a method for its growth and optimization, we propose an SLAM system that can reconstruct and render real-world datasets while maintaining speed and efficiency. Gaussian SLAM is able to reconstruct and render real-world scenes with photorealistic fidelity. We evaluate our method on common synthetic and real-world datasets, comparing it against other state-of-the-art SLAM methods. Finally, we demonstrate that the resulting 3D scene representation can be efficiently rendered in real-time using Gaussian splatting.

ReconFusion

ReconFusion is a 3D reconstruction method that leverages diffusion priors to reconstruct real-world scenes from a limited number of photographs. It combines Neural Radiance Fields (NeRFs) with diffusion priors, enabling the synthesis of realistic geometry and textures at new camera poses beyond the input image set. This method is trained on diffusion priors with both limited-view and multi-view datasets, allowing it to synthesize realistic geometry and textures in unconstrained regions while preserving the appearance of the observed region. ReconFusion has been extensively evaluated on various real-world datasets, including forward and 360-degree scenes, demonstrating significant performance improvements.

AI image generation

NVAS3d

NVAS3d is a project for estimating sound at any location within a scene containing multiple unknown sound sources. It achieves novel-view acoustic synthesis by using audio recordings from multiple microphones and the 3D geometry and materials of the scene.

AI Audio Enhancer

PanoHead

PanoHead is a 360° geometric sensory 3D full head synthesis method that can be trained solely using unstructured outdoor images to achieve consistent 360° full-body head image synthesis with various appearances and detailed geometric shapes.

AI image generation

Chupa

Chupa is a 3D human generation pipeline that combines the generation capabilities of diffusion models with neural rendering techniques to create diverse and realistic 3D human figures. The pipeline can easily generalize to unseen human poses and produce lifelike results. Chupa generates a variety of high-quality 3D human meshes from the SMPL-X grid in the latent space.

AI image generation

Neuralangelo by NVIDIA

Neuralangelo By NVIDIA

Neuralangelo is an AI model developed by NVIDIA Research that uses neural networks for 3D reconstruction. It can convert 2D video clips into detailed 3D structures, generating realistic virtual buildings, sculptures, and other objects. It accurately captures the textures of complex materials, including roof tiles, glass window panes, and polished marble. Creative professionals can import these 3D objects into design applications for further editing, using them in art, video game development, robotics, and industrial digital twins. Neuralangelo's 3D reconstruction capabilities will greatly benefit creators by enabling them to re-create the real world in the digital realm. Ultimately, this tool will allow developers to integrate detailed objects (ranging from small sculptures to large buildings) into virtual environments for applications such as video games or industrial digital twins.

Featured AI Tools

Flow AI

Flow is an AI-driven movie-making tool designed for creators, utilizing Google DeepMind's advanced models to allow users to easily create excellent movie clips, scenes, and stories. The tool provides a seamless creative experience, supporting user-defined assets or generating content within Flow. In terms of pricing, the Google AI Pro and Google AI Ultra plans offer different functionalities suitable for various user needs.

Video Production

NoCode

NoCode is a platform that requires no programming experience, allowing users to quickly generate applications by describing their ideas in natural language, aiming to lower development barriers so more people can realize their ideas. The platform provides real-time previews and one-click deployment features, making it very suitable for non-technical users to turn their ideas into reality.

Development Platform

ListenHub

ListenHub is a lightweight AI podcast generation tool that supports both Chinese and English. Based on cutting-edge AI technology, it can quickly generate podcast content of interest to users. Its main advantages include natural dialogue and ultra-realistic voice effects, allowing users to enjoy high-quality auditory experiences anytime and anywhere. ListenHub not only improves the speed of content generation but also offers compatibility with mobile devices, making it convenient for users to use in different settings. The product is positioned as an efficient information acquisition tool, suitable for the needs of a wide range of listeners.

MiniMax Agent

MiniMax Agent is an intelligent AI companion that adopts the latest multimodal technology. The MCP multi-agent collaboration enables AI teams to efficiently solve complex problems. It provides features such as instant answers, visual analysis, and voice interaction, which can increase productivity by 10 times.

Multimodal technology

Tencent Hunyuan Image 2.0

Tencent Hunyuan Image 2.0

Tencent Hunyuan Image 2.0 is Tencent's latest released AI image generation model, significantly improving generation speed and image quality. With a super-high compression ratio codec and new diffusion architecture, image generation speed can reach milliseconds, avoiding the waiting time of traditional generation. At the same time, the model improves the realism and detail representation of images through the combination of reinforcement learning algorithms and human aesthetic knowledge, suitable for professional users such as designers and creators.

Image Generation

OpenMemory MCP

OpenMemory is an open-source personal memory layer that provides private, portable memory management for large language models (LLMs). It ensures users have full control over their data, maintaining its security when building AI applications. This project supports Docker, Python, and Node.js, making it suitable for developers seeking personalized AI experiences. OpenMemory is particularly suited for users who wish to use AI without revealing personal information.

FastVLM

FastVLM is an efficient visual encoding model designed specifically for visual language models. It uses the innovative FastViTHD hybrid visual encoder to reduce the time required for encoding high-resolution images and the number of output tokens, resulting in excellent performance in both speed and accuracy. FastVLM is primarily positioned to provide developers with powerful visual language processing capabilities, applicable to various scenarios, particularly performing excellently on mobile devices that require rapid response.

Image Processing

LiblibAI

LiblibAI is a leading Chinese AI creative platform offering powerful AI creative tools to help creators bring their imagination to life. The platform provides a vast library of free AI creative models, allowing users to search and utilize these models for image, text, and audio creations. Users can also train their own AI models on the platform. Focused on the diverse needs of creators, LiblibAI is committed to creating inclusive conditions and serving the creative industry, ensuring that everyone can enjoy the joy of creation.

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase